Model Selection

Efficient Inference Optimization

# Efficient Inference Optimization

AM Thinking V1 GGUF

AM-Thinking-v1 is a text generation model based on the GGUF format, suitable for various natural language processing tasks.

Large Language Model

Qwen3 235B A22B Exl3

The Exllamav3 quantized version of Qwen3-235B-A22B, offering multiple quantization options to optimize model size and performance.

Large Language Model

Llama 3.1 Nemotron Nano 4B V1.1

Llama-3.1-Nemotron-Nano-4B-v1.1 is a compressed and optimized large language model based on Llama 3.1, focusing on inference and dialogue tasks, supporting 128K context length, and compatible with a single RTX GPU.

Large Language Model

Transformers English

Falcon H1 34B Instruct

Falcon-H1 is an efficient hybrid architecture language model developed by TII, combining the advantages of Transformers and Mamba architectures, supporting English and multilingual tasks.

Large Language Model

Falcon H1 34B Base

Falcon-H1 is a hybrid architecture language model developed by the UAE's Technology Innovation Institute, combining the strengths of Transformers and Mamba architectures, supporting multilingual processing.

Large Language Model

Transformers Supports Multiple Languages

Open Thoughts OpenThinker2 7B GGUF

Quantized version of OpenThinker2-7B, using llama.cpp for quantization, suitable for text generation tasks.

Large Language Model

Nemotron H 8B Base 8K

The NVIDIA Nemotron-H-8B-Base-8K is a large language model (LLM) developed by NVIDIA, designed to generate completions for given text fragments. The model adopts a hybrid architecture primarily composed of Mamba-2 and MLP layers, incorporating only four attention layers. It supports a context length of 8K and covers multiple languages including English, German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, and Chinese.

Large Language Model

Transformers Supports Multiple Languages

Llama 3.1 Nemotron Nano 8B V1

An inference and dialogue model optimized from Meta Llama-3.1-8B-Instruct, supporting 128K context length, balancing efficiency and performance

Large Language Model

Transformers English

Gemma 3 12b It Q5 K S GGUF

This is the GGUF quantized version of Google Gemma 3B model, suitable for local inference and supports text generation tasks.

Large Language Model

Gemma 3 27b It Q4 K M GGUF

This model is a GGUF format version converted from Google's Gemma 3 27B IT model, suitable for local inference.

Large Language Model

paultimothymooney

Bge Reranker V2 M3 Q8 0 GGUF

This is a GGUF format text ranking model converted from the BAAI/bge-reranker-v2-m3 model, supporting multilingual text embedding inference.

Text Embedding Other

Formatclassifier

The FormatClassifier model categorizes web content into 24 classes based on URL and text content.

Text Classification

Transformers Other

Topicclassifier

A topic classification model fine-tuned based on gte-base-en-v1.5, capable of classifying web content into 24 categories

Text Classification

Transformers Other

PLaMo 2 8B is an 8-billion-parameter hybrid architecture language model developed by Preferred Elements, supporting English and Japanese text generation.

Large Language Model

Transformers Supports Multiple Languages

PLaMo 2 1B is a 1-billion-parameter model developed by Preferred Elements, pretrained on English and Japanese datasets, featuring a hybrid architecture combining Mamba and sliding window attention mechanisms.

Large Language Model

Transformers Supports Multiple Languages

Modernbert Large Squad2 V0.1

A QA model fine-tuned on SQuAD 2.0 dataset based on ModernBERT-large, supporting long-context processing

Question Answering System

Ichigo Llama3.1 S Instruct V0.4 GGUF

A statically quantized model based on Menlo/Ichigo-llama3.1-s-instruct-v0.4, offering multiple quantization versions to suit different hardware requirements.

Large Language Model English

Deepseek V2 Lite

DeepSeek-V2-Lite is a cost-efficient Mixture of Experts (MoE) language model with a total of 16B parameters and 2.4B active parameters, supporting a 32k context length.

Large Language Model

Meta Llama 3 8B Instruct Function Calling Json Mode

This model is fine-tuned based on meta-llama/Meta-Llama-3-8B-Instruct, specifically designed for function calling and JSON mode.

Large Language Model

Transformers English

Minicpm MoE 8x2B

MiniCPM-MoE-8x2B is a Transformer-based Mixture of Experts (MoE) language model, designed with 8 expert modules where each token activates 2 experts for processing.

Large Language Model

Decilm 6b Instruct

DeciLM 6B-Instruct Model is an English language model specifically designed for short-format instruction following, trained using LoRA fine-tuning technology based on DeciLM 6B

Large Language Model

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase